PoYu Lin
2024-08-14
The first six months after a stroke is the golden period for rehabilitation, during which motor abilities recover the fastest.
For patients undergoing rehabilitation:
This project uses the “Rate of Stroke Patients Receiving Rehabilitation Services During Hospitalization or Within Four Months After Discharge” dataset, published by the Taiwanese government.
By combining the geographic data, hospital levels, and time information within the dataset, I analyze the distribution of medical resources for stroke rehabilitation in Taiwan.
The original dataset, “Rate of Stroke Patients Receiving Rehabilitation Services During Hospitalization or Within Four Months After Discharge”, can be downloaded from the link below:
https://data.gov.tw/dataset/79568
Since the season information was originally recorded in Chinese, I have translated it and uploaded the pre-processed dataset onto GitHub. If your computer cannot read Chinese, please use the following code to load the data that I have already converted to English.
MainData <- read.csv("https://raw.githubusercontent.com/VilcekSummerR/final-assignment-Po-yuLin/main/TaiwanStrokeRehab.csv")
head(MainData)## Year_Season HospitalID HospitalClass StrokeRehabNumber StrokeNumber
## 1 2016_1 101090517 2 106 185
## 2 2016_2 101090517 2 112 171
## 3 2016_3 101090517 2 133 206
## 4 2016_4 101090517 2 129 177
## 5 2017_1 101090517 2 97 162
## 6 2017_2 101090517 2 100 162
## StrokeRehabRate StrokeRehabRegionRate StrokeRehabCountryRate CityCode
## 1 0.5730 0.6688 0.6686 63000
## 2 0.6550 0.7007 0.6761 63000
## 3 0.6456 0.6887 0.6824 63000
## 4 0.7288 0.6869 0.6773 63000
## 5 0.5988 0.6762 0.6907 63000
## 6 0.6173 0.6849 0.6963 63000
## TownCode
## 1 63000060
## 2 63000060
## 3 63000060
## 4 63000060
## 5 63000060
## 6 63000060
My code for main data pre-processing is listed below. If your computer can read Chinese, you can use the code in this section to download the file from the official source.
#install.packages("dplyr")
library(dplyr)
#Read in the data from the official website of National Health Insurance, Taiwan
MainData <- read.csv("https://info.nhi.gov.tw/api/iode0000s01/Dataset?rId=A21003000I-E3201N-001")
#Remove the Chinese names of the hospitals to avoid display issues on other computer
MainData <- MainData[,-3]
#Rename the column name into English Version
names(MainData) <- c("Year_Season", "HospitalID", "HospitalClass",
"StrokeRehabNumber", "StrokeNumber",
"StrokeRehabRate", "StrokeRehabRegionRate",
"StrokeRehabCountryRate", "CityCode", "TownCode")
#Reorganize the season information to eliminate Chinese in the table
MainData$Year_Season <- gsub("(\\d{4})年第一季", "\\1_1", MainData$Year_Season)
MainData$Year_Season <- gsub("(\\d{4})年第二季", "\\1_2", MainData$Year_Season)
MainData$Year_Season <- gsub("(\\d{4})年第三季", "\\1_3", MainData$Year_Season)
MainData$Year_Season <- gsub("(\\d{4})年第四季", "\\1_4", MainData$Year_Season)
head(MainData)The original map dataset of Taiwan “Administrative Boundary Maps of Municipalities and Counties (Cities) in Our Country,” can be downloaded from Taiwan Ministry of the Interior Geographic Information Cloud Integration Service Platform. The original map can be found in the link below:
Since the Name of County/City information was originally recorded in Chinese, I have translated it.
Additionally, the original map covered too wide an area and lacked centroid information for each County/City.
I have addressed these issues and uploaded the pre-processed dataset onto GitHub. If your computer cannot read Chinese, please use the following code to load the data that I have already converted to English.
#install.packages("sf")
#install.packages("ggplot2")
library(sf)
library(ggplot2)
download.file("https://github.com/VilcekSummerR/final-assignment-Po-yuLin/raw/main/TaiwanMap_English.zip", destfile = "TaiwanMap_English.zip")
unzip("TaiwanMap_English.zip", exdir = "TaiwanMap_English")
TaiwanMap <- st_read("TaiwanMap_English/COUNTY_MOI_1130718.shp")## Reading layer `COUNTY_MOI_1130718' from data source
## `C:\NYU Biomedical Informatics\01R Programming for Data Analysis\Final project\TaiwanMap_English\COUNTY_MOI_1130718.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 22 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 118.138 ymin: 21.75614 xmax: 123.6893 ymax: 26.38528
## Geodetic CRS: GCS_TWD97_2020
ggplot(data = TaiwanMap) +
geom_sf(aes(fill = COUNTYNAME)) +
scale_fill_discrete(name = "County/City") +
geom_text(aes(x = x, y = y, label = COUNTYNAME), size = 2, color = "black") +
theme_minimal() +
theme(legend.position = "right")My code for map data pre-processing is listed in the following two code chunks. I have uploaded the original map to the project GitHub. If your computer can read Chinese, you can use the code in this section to download the original file from GitHub.
library(sf)
library(ggplot2)
download.file("https://github.com/VilcekSummerR/final-assignment-Po-yuLin/raw/main/TaiwanMap.zip", destfile = "TaiwanMap.zip")
unzip("TaiwanMap.zip", exdir = "TaiwanMap")
TaiwanMap <- st_read("TaiwanMap/COUNTY_MOI_1130718.shp")
ggplot(data = TaiwanMap) +
geom_sf(aes(fill = COUNTYNAME)) +
scale_fill_discrete(name = "County/City") +
theme_minimal() +
theme(legend.position = "right")The original map covers a wide area, and the Name of County/City are in Chinese. Therefore, I used the following code to translate them into English and adjust the map to the appropriate area. Also, I calculated the centriod of each County/City.
library(dplyr)
library(sf)
#Check the list of COUNTYNAME and translate them into English
#unique(TaiwanMap$COUNTYNAME)
#Translate Name of County/City into English
TaiwanMap <- TaiwanMap %>% mutate(
COUNTYNAME = recode(COUNTYNAME,
"臺東縣" = "Taitung County",
"屏東縣" = "Pingtung County",
"雲林縣" = "Yunlin County",
"彰化縣" = "Changhua County",
"苗栗縣" = "Miaoli County",
"新竹縣" = "Hsinchu County",
"嘉義縣" = "Chiayi County",
"高雄市" = "Kaohsiung City",
"宜蘭縣" = "Yilan County",
"連江縣" = "Lienchiang County",
"金門縣" = "Kinmen County",
"臺中市" = "Taichung City",
"澎湖縣" = "Penghu County",
"南投縣" = "Nantou County",
"花蓮縣" = "Hualien County",
"基隆市" = "Keelung City",
"臺北市" = "Taipei City",
"新北市" = "New Taipei City",
"臺南市" = "Tainan City",
"桃園市" = "Taoyuan City",
"嘉義市" = "Chiayi City",
"新竹市" = "Hsinchu City"
)
)
#Adjust the map to the appropriate area
#st_bbox(TaiwanMap)
MapShowBox <- st_bbox(c(xmin = 118, xmax = 124, ymin = 21, ymax = 27))
TaiwanMap <- st_crop(TaiwanMap, MapShowBox)
#Calculate the centroid of each county (This was aided by Microsoft Copilot)
TaiwanMap <- TaiwanMap %>% select(COUNTYNAME, COUNTYCODE, geometry) %>%
mutate(centroid = st_centroid(geometry)) %>%
mutate(coords = st_coordinates(centroid)) %>%
mutate(x = coords[, 1], y = coords[, 2]) %>%
select(-centroid, -coords)
ggplot(data = TaiwanMap) +
geom_sf(aes(fill = COUNTYNAME)) +
scale_fill_discrete(name = "County/City") +
geom_text(aes(x = x, y = y, label = COUNTYNAME), size = 2, color = "black") +
theme_minimal() +
theme(legend.position = "right")
#Save the pre-processed map
#st_write(TaiwanMap, "TaiwanMap_English/COUNTY_MOI_1130718.shp")We can also plot an interactive map with tmap library.
First, I check the datatype in each column using the sapply() function, and I change the datatype of HospitalClass to a factor using recode_factor(). The datatypes of the other columns are correct.
## Year_Season HospitalID HospitalClass
## "character" "integer" "integer"
## StrokeRehabNumber StrokeNumber StrokeRehabRate
## "integer" "integer" "numeric"
## StrokeRehabRegionRate StrokeRehabCountryRate CityCode
## "numeric" "numeric" "integer"
## TownCode
## "integer"
MainData$HospitalClass <- recode_factor(MainData$HospitalClass, "1" = "Medical Center", "2" = "Regional Hospital", "3" = "Local Hospital")
MainData$CityCode <- as.character(MainData$CityCode)
sapply(MainData[c("HospitalClass","CityCode")], class)## HospitalClass CityCode
## "factor" "character"
Second, I tried to divide the first column, Year_Season, into two columns, Year and Season. The datatype for these two new columns should be integer or numeric.
## [1] "2016_1" "2016_2" "2016_3" "2016_4" "2017_1" "2017_2"
MainData <- MainData %>% separate(Year_Season, into = c("Year", "Season"), sep = "_")
MainData[c("Year", "Season")] <- lapply(MainData[c("Year", "Season")], as.numeric)
sapply(MainData[c("Year", "Season")], class)## Year Season
## "numeric" "numeric"
Third, I change the MainData column CityCode to COUNTYCODE to match the column names in the map file. At the same time, I manually handle the three mismatched COUNTYCODE values.
Then, using this consistent column, I bring COUNTYNAME from the map into the MainData.
library(dplyr)
MainData <- MainData %>% rename("COUNTYCODE" = "CityCode")
#generate a table for COUNTYCODE and COUNTYNAME in map
TaiwanMapUnique <- TaiwanMap %>% group_by(COUNTYCODE) %>% summarise(COUNTYNAME = first(COUNTYNAME))
#find the mismatch value, and change them to correct correlation value
#Mismatch <- setdiff(unique(MainData$COUNTYCODE), TaiwanMapUnique$COUNTYCODE)
#print(Mismatch)
#Mismatch2 <- setdiff(TaiwanMapUnique$COUNTYCODE, unique(MainData$COUNTYCODE))
#print(Mismatch2)
MainData$COUNTYCODE <- gsub("10021", "67000", MainData$COUNTYCODE)
MainData$COUNTYCODE <- gsub("9020", "09020", MainData$COUNTYCODE)
MainData$COUNTYCODE <- gsub("9007", "09007", MainData$COUNTYCODE)
MainData <- left_join(MainData, TaiwanMapUnique, by = "COUNTYCODE")
#Delete unnecessary columns
MainData <- MainData[c("Year", "Season", "HospitalID", "HospitalClass", "StrokeRehabNumber", "StrokeNumber", "StrokeRehabRate", "StrokeRehabRegionRate", "StrokeRehabCountryRate", "COUNTYCODE", "COUNTYNAME")]
summary(MainData)## Year Season HospitalID HospitalClass
## Min. :2016 Min. :1.000 Min. :1.011e+08 Medical Center : 704
## 1st Qu.:2018 1st Qu.:2.000 1st Qu.:6.020e+08 Regional Hospital:2281
## Median :2020 Median :3.000 Median :1.111e+09 Local Hospital :3326
## Mean :2020 Mean :2.501 Mean :9.584e+08
## 3rd Qu.:2022 3rd Qu.:4.000 3rd Qu.:1.307e+09
## Max. :2023 Max. :4.000 Max. :1.543e+09
## StrokeRehabNumber StrokeNumber StrokeRehabRate StrokeRehabRegionRate
## Min. : 0.00 Min. : 1.00 Min. :0.0000 Min. :0.5760
## 1st Qu.: 3.00 1st Qu.: 4.00 1st Qu.:0.5556 1st Qu.:0.6921
## Median : 13.00 Median : 21.00 Median :0.7200 Median :0.7183
## Mean : 27.82 Mean : 38.86 Mean :0.6696 Mean :0.7172
## 3rd Qu.: 40.00 3rd Qu.: 55.00 3rd Qu.:0.8489 3rd Qu.:0.7438
## Max. :229.00 Max. :265.00 Max. :1.0000 Max. :0.7981
## StrokeRehabCountryRate COUNTYCODE COUNTYNAME
## Min. :0.6686 Length:6311 Length:6311
## 1st Qu.:0.6963 Class :character Class :character
## Median :0.7101 Mode :character Mode :character
## Mean :0.7159
## 3rd Qu.:0.7411
## Max. :0.7581
I check for any missing values in the data. Fortunately, there were no missing values.
## Year Season HospitalID
## 0 0 0
## HospitalClass StrokeRehabNumber StrokeNumber
## 0 0 0
## StrokeRehabRate StrokeRehabRegionRate StrokeRehabCountryRate
## 0 0 0
## COUNTYCODE COUNTYNAME
## 0 0
In order to analyze the distribution of rehabilitation resources according to geographic data, hospital levels, and time information, I reorganized the data based on geographic and time separately.
I selected the most recent data, which was from 2023, seasons 1 to 4, and grouped them based on geographic data in order to be plotted on the map. This StrokeRehabGeographic subset can help us analyze the current state of post-stroke rehabilitation in different County/City.
#install.packages("tidyverse")
library(tidyverse)
StrokeRehabGeographic <- MainData %>% filter (Year == 2023)
StrokeRehabGeographic <- MainData %>% group_by(COUNTYCODE) %>%
summarise(
COUNTYNAME = first(COUNTYNAME),
StrokeRehabNumber = sum(StrokeRehabNumber),
StrokeNumber = sum(StrokeNumber),
StrokeRehabRate = sum(StrokeRehabNumber) / sum(StrokeNumber)
)
summary(StrokeRehabGeographic)## COUNTYCODE COUNTYNAME StrokeRehabNumber StrokeNumber
## Length:22 Length:22 Min. : 7 Min. : 28
## Class :character Class :character 1st Qu.: 2501 1st Qu.: 3654
## Mode :character Mode :character Median : 4040 Median : 5674
## Mean : 7980 Mean :11148
## 3rd Qu.:13812 3rd Qu.:19655
## Max. :25731 Max. :38239
## StrokeRehabRate
## Min. :0.2500
## 1st Qu.:0.6788
## Median :0.7116
## Mean :0.6846
## 3rd Qu.:0.7468
## Max. :0.7963
To select hospitals without missing values for all years and all seasons, I transformed the main dataset into a wide table format using pivot_wider() and used na.omit() to remove hospitals with missing values. I then used the hospitalID from this filtered table to filter the MainData again, creating the StrokeRehabTime subset. This subset can help us fairly analyze the trend of post-stroke rehabilitation over time.
library(tidyr)
library(dplyr)
StrokeRehabTime <- MainData[c("Year", "Season", "HospitalID", "HospitalClass",
"StrokeRehabRate", "COUNTYCODE", "COUNTYNAME")]
StrokeRehabTime <- StrokeRehabTime %>% pivot_wider(
names_from = c(Year, Season),
values_from = StrokeRehabRate,
names_sep = "_"
)
#check and delete hospital with missing value in some time during study
#colSums(is.na(StrokeRehabTime))
StrokeRehabTime <- na.omit(StrokeRehabTime)
StrokeRehabTime <- MainData %>% filter(HospitalID %in% StrokeRehabTime$HospitalID)
summary(StrokeRehabTime)## Year Season HospitalID HospitalClass
## Min. :2016 Min. :1.00 Min. :1.011e+08 Medical Center : 704
## 1st Qu.:2018 1st Qu.:1.75 1st Qu.:5.120e+08 Regional Hospital:2048
## Median :2020 Median :2.50 Median :1.102e+09 Local Hospital :1312
## Mean :2020 Mean :2.50 Mean :8.902e+08
## 3rd Qu.:2021 3rd Qu.:3.25 3rd Qu.:1.142e+09
## Max. :2023 Max. :4.00 Max. :1.543e+09
## StrokeRehabNumber StrokeNumber StrokeRehabRate StrokeRehabRegionRate
## Min. : 0.0 Min. : 1.00 Min. :0.0000 Min. :0.5760
## 1st Qu.: 12.0 1st Qu.: 19.00 1st Qu.:0.6026 1st Qu.:0.6921
## Median : 29.0 Median : 41.00 Median :0.7200 Median :0.7168
## Mean : 40.5 Mean : 56.51 Mean :0.6999 Mean :0.7166
## 3rd Qu.: 56.0 3rd Qu.: 79.25 3rd Qu.:0.8113 3rd Qu.:0.7428
## Max. :229.0 Max. :265.00 Max. :1.0000 Max. :0.7981
## StrokeRehabCountryRate COUNTYCODE COUNTYNAME
## Min. :0.6686 Length:4064 Length:4064
## 1st Qu.:0.6959 Class :character Class :character
## Median :0.7091 Mode :character Mode :character
## Mean :0.7153
## 3rd Qu.:0.7385
## Max. :0.7581
In 2023, an ANOVA analysis of the rehabilitation rates after stroke, categorized by county, revealed significant differences (p-value = 0.0154).
library(dplyr)
anova_result <- aov(StrokeRehabRate ~ COUNTYNAME,
data = MainData[MainData$Year == 2023,])
summary(anova_result)## Df Sum Sq Mean Sq F value Pr(>F)
## COUNTYNAME 21 2.69 0.12809 1.798 0.0154 *
## Residuals 800 56.99 0.07124
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The boxplot demonstrates the distribution of rehabilitation rates after stroke according to different County/City.
There were indeed differences between County/City. LienChiang County and Penghu County had median rehabilitation rates lower than 50%, suggesting that the offshore islands lack sufficient stroke rehabilitation resources.
library(ggplot2)
ggplot(MainData[MainData$Year == 2023,],
aes(x = COUNTYNAME, y = StrokeRehabRate, fill = COUNTYNAME)) +
geom_boxplot(notch = TRUE) +
labs(title = "Taiwan Stroke Rehabilitation Rate by County/City",
x = "County/City",
y = "Stroke Rehab Rate") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
Draw the map using the tmap library. The lighter color suggested lower post-stroke rehabilitation rate. You can zoom in with the scroll of the mouse, and you can click on the map to see the rehabilitation rate in each County/City. (It will jump to next slide in html format, but you can see it using ← back to the current slide and the value.)
As we can see on the map, in addition to LienChiang County and Penghu County, Taitung County, Tainan City, Chiayi County, Chiayi City, Hsinchu County, Keelung City, and surprisingly, the capital of Taiwan, Taipei City, have relatively low rehabilitation rates after stroke.
library(tmap)
library(dplyr)
TaiwanMapGeographic <- left_join(TaiwanMap, StrokeRehabGeographic, by = c("COUNTYCODE","COUNTYNAME"))
tmap_mode("view")## tmap mode set to interactive viewing
tm_shape(TaiwanMapGeographic) +
tm_polygons("StrokeRehabRate",
palette = "Blues",
#set the popup word with mouse clicking
popup.vars = c("COUNTYNAME", "StrokeRehabRate")) +
tm_layout(title = "Taiwan Stroke Rehabilitation Rate by County/City",
legend.title.size = 1.2,
legend.text.size = 0.8) +
tm_view(view.legend.position = c("right", "bottom"))